Goto

Collaborating Authors

 analytical model


DCO: Dynamic Cache Orchestration for LLM Accelerators through Predictive Management

Zhou, Zhongchun, Lai, Chengtao, Gu, Yuhang, Zhang, Wei

arXiv.org Artificial Intelligence

Abstract--The rapid adoption of large language models (LLMs) is pushing AI accelerators toward increasingly powerful and specialized designs. Instead of further complicating software development with deeply hierarchical scratchpad memories (SPMs) and their asynchronous management, we investigate the opposite point of the design spectrum: a multi-core AI accelerator equipped with a shared system-level cache and application-aware management policies, which keeps the programming effort modest. Our approach exploits dataflow information available in the software stack to guide cache replacement (including dead-block prediction), in concert with bypass decisions and mechanisms that alleviate cache thrashing. We assess the proposal using a cycle-accurate simulator and observe substantial performance gains (up to 1.80x speedup) compared with conventional cache architectures. In addition, we build and validate an analytical model that takes into account the actual overlapping behaviors to extend the measurement results of our policies to real-world larger-scale workloads. Experiment results show that when functioning together, our bypassing and thrashing mitigation strategies can handle scenarios both with and without inter-core data sharing and achieve remarkable speedups. Finally, we implement the design in RTL and the area of our design is 0.064mm Our findings explore the potential of the shared cache design to assist the development of future AI accelerator systems. ITH the advent of the artificial intelligence (AI) era, the demand for AI-tailored hardware has surged across various environments, from data centers to embedded systems. A preliminary version of this paper appeared in the proceedings of ICS 2024. Z. Zhou and C. Lai contributed equally to this work. Z. Zhou and C. Lai are with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: zzhouch@connect.ust.hk; Gu is with the School of Electronic Science and Engineering, Southeast University, Nanjing, Jiangsu, China W . Zhang (corresponding author) is with the Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: eeweiz@ust.hk). Personal use of this material is permitted. These accelerators span a broad spectrum, from power-efficient devices to those designed for high computational throughput [34]. AI accelerators, compared with Graphics Processing Units (GPUs), can be optimized for AI applications and tailored for specific scenarios, such as pre-defined neural network (NN) computation graphs, operator types, certain data precision, and given power budgets. Since they are often used in scenarios where the execution graph is known during compilation, they typically employ software-controlled scratchpad memories (SPMs) as the on-chip storage.




DOSA: Differentiable Model-Based One-Loop Search for DNN Accelerators

Hong, Charles, Huang, Qijing, Dinh, Grace, Subedar, Mahesh, Shao, Yakun Sophia

arXiv.org Artificial Intelligence

In the hardware design space exploration process, it is critical to optimize both hardware parameters and algorithm-to-hardware mappings. Previous work has largely approached this simultaneous optimization problem by separately exploring the hardware design space and the mapspace - both individually large and highly nonconvex spaces - independently. The resulting combinatorial explosion has created significant difficulties for optimizers. In this paper, we introduce DOSA, which consists of differentiable performance models and a gradient descent-based optimization technique to simultaneously explore both spaces and identify high-performing design points. Experimental results demonstrate that DOSA outperforms random search and Bayesian optimization by 2.80x and 12.59x, respectively, in improving DNN model energy-delay product, given a similar number of samples. We also demonstrate the modularity and flexibility of DOSA by augmenting our analytical model with a learned model, allowing us to optimize buffer sizes and mappings of a real DNN accelerator and attain a 1.82x improvement in energy-delay product.


Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling

Patwari, Rajeev, Sirasao, Ashish, Das, Devleena

arXiv.org Artificial Intelligence

Large language models (LLMs) have been increasingly deployed as local agents on personal devices with CPUs, NPUs and integrated GPUs. However, forecasting inference performance on devices with such heterogeneity remains challenging due to the dynamic compute and memory demands. Existing approaches rely on GPU benchmarking or machine learning-based latency predictors, which are often hardware-specific and lack generalizability. To this end, we introduce LIFE, a lightweight and modular analytical framework that is comprised of modular analytical model of operators, configurable to characterize LLM inference workloads in a hardware and dataset-agnostic manner. LIFE characterizes the influence of software and model optimizations, such as quantization, KV cache compression, LoRA adapters, chunked prefill, different attentions, and operator fusion, on performance metrics such as time-to-first-token (TTFT), time-per-output-token (TPOT) and tokens-per-second (TPS). LIFE enables performance forecasting using only hardware specifications, such as TOPS and memory bandwidth, without requiring extensive dataset benchmarking. We validate LIFE's forecasting with inference on AMD Ryzen CPUs, NPUs, iGPUs and NVIDIA V100 GPUs, with Llama2-7B variants, demonstrating the utility of LIFE in forecasting LLM performance through lens of system efficiency to enable efficient LLM deployment across different hardware platforms.


Machine Learning-Driven Compensation for Non-Ideal Channels in AWG-Based FBG Interrogator

Kazakov, Ivan A., Kulichenko, Iana V., Kovalev, Egor E., Treskova, Angelina A., Barma, Daria D., Malakhov, Kirill M., Oseledets, Ivan V., Shipulin, Arkady V.

arXiv.org Artificial Intelligence

We present an experimental study of a fiber Bragg grating (FBG) interrogator based on a silicon oxynitride (SiON) photonic integrated arrayed waveguide grating (AWG). While AWG-based interrogators are compact and scalable, their practical performance is limited by non-ideal spectral responses. To address this, two calibration strategies within a 2.4 nm spectral region were compared: (1) a segmented analytical model based on a sigmoid fitting function, and (2) a machine learning (ML)-based regression model. The analytical method achieves a root mean square error (RMSE) of 7.11 pm within the calibrated range, while the ML approach based on exponential regression achieves 3.17 pm. Moreover, the ML model demonstrates generalization across an extended 2.9 nm wavelength span, maintaining sub-5 pm accuracy without re-fitting. Residual and error distribution analyses further illustrate the trade-offs between the two approaches. ML-based calibration provides a robust, data-driven alternative to analytical methods, delivering enhanced accuracy for non-ideal channel responses, reduced manual calibration effort, and improved scalability across diverse FBG sensor configurations.


Robust Moment Identification for Nonlinear PDEs via a Neural ODE Approach

Chen, Shaoxuan, Yang, Su, Kevrekidis, Panayotis G., Zhu, Wei

arXiv.org Artificial Intelligence

There exist numerous nonlinear partial differential equations in dispersive, as well as in dissipative systems which are of broad applicability in a wide range of sp atio-temporally dependent physical and biological settings. For instance, on the dispersive si de, the Nonlinear Schr odinger (NLS) model [1, 2] has been argued to be of relevance to optical [3, 4 ] and atomic physics [5, 6, 7] to plasma [8, 9] as well as fluid research [10, 9] and even to bio logical applications such as the DNA denaturation [11, 12]. This work has also been central to the seminal contributions of S. Aubry, especially in connection to discrete solitons and br eathers [13, 14, 15]. In a similar vein, in dissipative, reaction-diffusion systems one of the central m odels has been the Fisher-KPP (FKPP) equation which was originally conceived in the context of sp atial spread of advantageous alleles, and has since been employed to model species invasion, popul ation dispersal, and ecological front propagation [16, 17]. The FKPP model has also constituted fo r almost a century now a hallmark of reaction-diffusion models with a wide range of application s to cancer modeling, wound healing, flame propagation and a diverse further host of applications up to this day [18, 19] While such classical PDE models have for decades now been at t he center of initially analytical and subsequently computational studies, over the past deca de or so, a new suite of data-driven tools and techniques has met with explosive growth offering a n ew avenue and a fresh perspective enabling unprecedented developments.


Design of Trimmed Helicoid Soft-Rigid Hybrid Robots

Patterson, Zach J., Sologuren, Emily R., Rus, Daniela

arXiv.org Artificial Intelligence

As soft robot design matures, researchers have converged to sophisticated design paradigms to enable the development of more suitable platforms. Two such paradigms are soft-rigid hybrid robots, which utilize rigid structural materials in some aspect of the robot's design, and architectured materials, which deform based on geometric parameters as opposed to purely material ones. In this work, we combine the two design approaches, utilizing trimmed helicoid structures in series with rigid linkages. Additionally, we extend the literature on wave spring-inspired soft structures by deriving a mechanical model of the stiffness for arbitrary geometries. We present a novel manufacturing method for such structures utilizing an injection molding approach and we make available the design tool to generate 3D printed molds for arbitrary designs of this class. Finally, we produce a robot using the above methods and operate it in closed-loop demonstrations.


Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Nasr-Esfahany, Arash, Alizadeh, Mohammad, Lee, Victor, Alam, Hanna, Coon, Brett W., Culler, David, Dadu, Vidushi, Dixon, Martin, Levy, Henry M., Pandey, Santosh, Ranganathan, Parthasarathy, Yazdanbakhsh, Amir

arXiv.org Artificial Intelligence

Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.


Explainable AI-Guided Efficient Approximate DNN Generation for Multi-Pod Systolic Arrays

Siddique, Ayesha, Khalil, Khurram, Hoque, Khaza Anuarul

arXiv.org Artificial Intelligence

Approximate deep neural networks (AxDNNs) are promising for enhancing energy efficiency in real-world devices. One of the key contributors behind this enhanced energy efficiency in AxDNNs is the use of approximate multipliers. Unfortunately, the simulation of approximate multipliers does not usually scale well on CPUs and GPUs. As a consequence, this slows down the overall simulation of AxDNNs aimed at identifying the appropriate approximate multipliers to achieve high energy efficiency with a minimum accuracy loss. To address this problem, we present a novel XAI-Gen methodology, which leverages the analytical model of the emerging hardware accelerator (e.g., Google TPU v4) and explainable artificial intelligence (XAI) to precisely identify the non-critical layers for approximation and quickly discover the appropriate approximate multipliers for AxDNN layers. Our results show that XAI-Gen achieves up to 7x lower energy consumption with only 1-2% accuracy loss. We also showcase the effectiveness of the XAI-Gen approach through a neural architecture search (XAI-NAS) case study. Interestingly, XAI-NAS achieves 40\% higher energy efficiency with up to 5x less execution time when compared to the state-of-the-art NAS methods for generating AxDNNs.